Day11：selenium 實作網站換頁爬蟲｜Kearch 1.0 爬蟲關鍵字報表工具

2018 iT 邦幫忙鐵人賽

DAY 10

Software Development

[行銷也要自動化] 用 Python Selenium + NodeJS + Amazon EC2 打造簡易關鍵字搜尋報表應用！系列第 12 篇

2018鐵人賽 selenium 換頁爬蟲行銷技術控 beautifulsoup

Kyle

2017-12-27 00:54:35

24686 瀏覽

分享至

本專題爬蟲系列文章：

Python scrapy 爬取 Y combinator Blog
Python requests 模擬網站登入爬蟲
 Python requests 與api 破解動態載入網頁爬蟲

適用情境

其實幾乎所有爬蟲都適合，但是selenium的速度相對慢了一些；所以盡量是在棘手的情況下再使用。

碰到"動態換頁" or"載入更多"時

像Google Search或大部分部落格的換頁都會產生獨立url；但有些網站的換頁是動態載入，甚至有的是滑到底自動載入更多的設計，必須讓scroll距離最上面的高度到一個值，才會觸發新的內容渲染到網頁。
這時用selenium能模擬使用者在網頁上行為來克服再適合不過。

程式碼邏輯

我的習慣是用邏輯去協助理解和記憶code，這樣比較能避免死記用法但無法活用的窘境。

開頭一定是先import需要的東西
接著assign待會需要的variable
啟動我們的主角selenium webdriver，指定使用Chrome
讓瀏覽器前往目標網站、執行某個行為
用BeautifulSoup解析當前的html、篩選所需的物件print出來或加到list中
結束這次的瀏覽器行為

實作開始－程式碼架構

可以選擇用jupyter或python script(在terminal打"python"就可以啟動)
jupyter好處是可以把結果直接存在.ipynb，適合如果對code還不熟悉需要多次嘗試的朋友。

開頭先import需要的東西

from selenium import webdriver
from bs4 import BeautifulSoup

assign待會需要的variable

# list用來存等一下所選的所有物件
ELE = []

啟動selenium webdriver，指定使用Chrome

browser = webdriver.Chrome()
webdriver.get('https://anewstip.com/search/tweets/?q=AI+mobile&fsb=journalists#')

讓瀏覽器前往目標網站、執行某個行為；這邊因為我想要自動換頁，於是觀察一下頁面底部的Next按鈕，它永遠都叫"Next"，因此很適合用xpath特定文字來定位：

browser.find_element_by_xpath('//div[@class="pages-select"]/a[contains(text(), "Next")]').click()

每次換頁後都要先用BeautifulSoup解析，才能篩選物件並印出

soup = BeautifulSoup(browser.page_source, 'html.parser')
for ele in soup.select('.info-name'):
    print(ele.text)

結束這次的瀏覽器行為

browser.close()

總結

以上片段組合並加入迴圈：（因為希望它每爬完一頁作者姓名後自動換頁繼續爬）

from selenium import webdriver
from bs4 import BeautifulSoup

ELE = []

browser = webdriver.Chrome()
webdriver.get('https://anewstip.com/search/tweets/?q=AI+mobile&fsb=journalists#')

# 先預設換頁10次
for i in range(1, 11):
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    for ele in soup.select('.info-name'):
        print(ele.text)

    browser.find_element_by_xpath('//div[@class="pages-select"]/a[contains(text(), "Next")]').click()

browser.close()